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Abstract —This study describes the experimental application 
of Machine Learning techniques to build prediction models that 
can assess the injnry risk associated with traffic accidents. This 
work uses an freely available data set of traffic accident records 
that took place in the city of Porto Alegre/RS (Brazil) during the 
year of 2013. This study also provides an analysis of the most 
important attributes of a traffic accident that could produce an 
outcome of injury to the people Involved in the accident. 

Index Terms —Injury risk assessment, classification, traffic 
accidents, machine learning. 

I. Introduction 

Statistics provided by EPTC ||T] - the traffic managing 
agency in Porto Alegre/RS (Brazil) - shows that in 2013, 
approximately 22.447 traffic accidents took place in the city 
of Porto Alegre, an average of 1.870 traffic accidents per 
month. According to Saunier et al. Q, the social cost of 
road collisions is the largest side effect of road transportation 
. The costs of fatalities, injuries and property damage, as well 
as medical care and traffic delays accounts for a significant 
impact on the finance of the people involved, cities and the 
government. According to Brazilian National Traffic Depart¬ 
ment (DENATRAN) Q, the average cost for an accident 
(in federal highways) without victims is R$ 16.840,00, for 
accidents with victims this cost increases to R$ 86.032,00 and 
for accidents with fatalities, the cost is R$ 418.341,00. 

Descriptive analysis of the situation in road safety and road 
accidents are important, but understanding the factors related 
with dangerous situations and patterns in data is of even 
greater importance Q. Being able to predict when an traffic 
accident will result in an injury, can help traffic agencies to 
provide faster medical care. Another example of the benefits 
of understanding the factors behind the injury risk is to 
guide traffic agencies to improve the road safety by means 
of infrastructure design (which includes road signs and speed 
control devices) or even through the pedestrian/driver behavior 
improvements that could be obtained with targeted marketing 
campaigns. Data-driven decisions can also help the traffic 
agencies to reduce the costs involved in a traffic accident. 

This paper describes the efforts and experimental results ob¬ 
tained through the application of Machine Learning techniques 
in order to provide a better understanding of the data that is 
being collected today by the traffic agency. The main purpose 
of this work, is to get an overall understanding of the accident 
data as well as build a predictor for the injury risk related to 
the traffic accidents in the city of Porto Alegre/RS (Brazil). 

This study is organized as follows: in Section |I^ an 
overview of the related work is provided, followed by the 


Section III where an overview of our proposed approach to 
the problem is described. In Section the experimental 
methodology is provided, including which algorithms were 
used an how they were used, while in Section |V] the results of 
the experimental analysis are detailed followed by the Section 
|VI| that describes the conclusions and remarks related to future 
work. 


II. Related Work 

An attempt has been made to search for existing accident 
analysis practice in the city of Porto Alegre, however no 
published works were found related to the application of 
Machine Learning techniques by the traffic managing agency 
(EPTC) in Porto Alegre. Only limited descriptive analysis 
were published on the site of the traffic agency 0, but 
no deeper analysis of the factors or injury risk assessment 
were published by the traffic managing agency. The lack of 
standardization for the data collection process and for the data 
itself between different traffic managing agencies worldwide 
makes the experimental results comparison very limited. 

Beshah et al. 0 explored a rich data set, comprising of 
14.254 accident cases described with 48 attributes containing 
information related to road users (drivers, pedestrians and 
passengers), vehicles and road. In their study, two predictive 
modeling methods were used: CART and Random Eorests. 
The experimental results done using CART analysis to assess 
the injury risk, scored with respect to the area under the 
ROC curve (AUC) a result of 0.8827. While running Random 
Eorests, the authors also found that the age of the victim, 
victim occupation, among others, were the attributes with the 
most predictive power. 

Saunier et al. Q investigated the collision factors and 
processes (i.e. the chain of events that lead to collisions) 
through the collection and analysis of microscopic data (road 
users trajectories). Saunier et al. 0 avoided the use of 
algorithms with a “black box” nature like ANNs (Artificial 
Neural Networks) or SVMs (Support Vector Machines) and 
used C4.5 (Decision Trees) instead and clustering analysis 
to investigate the collision factors. In their work, they found 
an strong relationship between the evasive actions and the 
interaction outcome: in most collisions (62 out of 82), no 
evasive action was attempted @- 

III. Proposed Approach 

The work described in this paper aims to evaluate different 
Machine Learning techniques in order to build a predictive 
model for injury risk assessment of traffic accident events 
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based on data that was collected by a traffic managing agency. 
Since the injury risk has a dichotomous nature in relation to 
the dependent variable, this facilitates the use of binary clas¬ 
sifiers used in this study like Logistic Regression or Support 
Vector Machines. This paper also evaluates the association 
between the traffic accident injury outcome and the possible 
contributory factors, an effort to understand which are the most 
important factors in an accident with an injury outcome. 

IV. Experimental Methodology 

This section describes the data set used, as well as the tools 
and algorithms used to perform the analysis. 


A. Traffic Accident Data Set 

The data set used in this study was obtained through 
Datapoa Q, an initiative from the city hall of Porto Alegre to 
provide open data access to many data sets related with the city 
itself. The traffic accident data set available at the Datapoa is 
licensed under the Open Database License (ODbL) Q, which 
is an Attribution and Share-Alike license for databases. 

Although the time span of the available traffic accident data 
sets ranges from the year 2000 up to 2013, only the data from 
the most recent data set was used (the data set related to the 
accidents that happened in the year of 2013). The data set 
is comprised of 20.798 accident records described using 44 
attributes. Some attributes of the data set are irrelevant for 
the purpose of this study and many attributes also presented 
duplicated data or invalid records, thus a step of data cleansing 
was required before using the data set. The data set also 
lacks detailed information about vehicles (i.e. age, movement), 
drivers (i.e. age of the driver, driver license level, driving 
experience, sex, etc.) and victims (i.e. age). 


B. Tools 

To plot heat maps with the geospatial distribution of the 
accidents, this study used the framework Django GIS Brasil 
0^ an open source project from the same author of this study, 
that aggregates geospatial information related with the Brazil¬ 
ian territory. To provide data analysis, the author used Pandas 
Q, an open source library providing high-performance, data 
structures and data analysis tools for the Python programming 
language. This study also used scikit-leam Q - an open 
source Machine Learning framework for the Python language 
- to perform data pre-processing and to build the predictive 
models. 


C. Machine Learning Techniques 

This study employed the following algorithms as classifiers 
for the injury risk assessment; Logistic Regression, Support 
Vector Machines, Naive Bayes, K-nearest neighbors and Ran¬ 
dom Forests. The details about the use, parametrization and 
model evaluation techniques used to assess the predictive 
models are described in the next sections. 


D. Logistic Regression 

The Logistic Regression used in this study is the Logistic 
Regression present in the scikit-leam framework Q, which 
in turn uses the LIBLINEAR eg implementation of the 
Logistic Regression. The LIBLINEAR implementation solve 
the following optimization problem; 
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Given a set of instance-label pairs = 1,... ,l where 

C is the penalty parameter and ^{w; Xi, yi) is the loss function, 
which for Logistic Regression is; 

logil + e-y'^"-') ( 2 ) 

In this study we used L2 regularized Logistic Regression with 
the penalty C equal to 1.0. 

E. Support Vector Machines 

The Support Vector Machine (SVM) from seikit-learn Q 
used in this work is based on the LIBSVM in implemen¬ 
tation, which is a C-Support Vector Classification. For more 
details about the algorithm implementation, please refer to the 
elucidative LIBSVM n original paper. The SVM algorithm 
was parametrized with a linear kernel and with 9.0 as the 
error term, both parameters were chosen using hyperparameter 
optimization through a non-exhaustive grid search between 
different kernel types (RBF, Polynomial and Linear) with 
different error term and gamma values. It is also important 
to note that Support Vector Machine algorithms are not scale 
invariant, so the the author applied a scaling function over the 
attributes before feeding attributes into the algorithm. 


F. Naive Bayes 

The Naive Bayes algorithm used in this study is also 
from seikit-learn The different Naive Bayes classifiers 
implement in seikit-learn differ mainly by the assumptions 
they make regarding the distribution of P{xi \ y) The 
author decided to use the Gaussian Naive Bayes, where the 
likelihood of the features is assumed to be Gaussian; 
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And where the parameters Oy and py are estimated using 
maximum likelihood 


G. K-nearest neighbors 

The K-nearest neighbors (kNN) algorithm used in this study 
is also from seikit-learn 0, which provides both unsupervised 
and supervised neighbors-based learning methods. Despite the 
simplicity of the algorithm, kNN has been successful in a large 
number of classification and regression problems. The number 
of neighbors used in this study (for the k value) is 8. This 
value was also found using a non-exhaustive hyperparameter 
optimization through the grid search technique. Attribute scal¬ 
ing was also performed before using kNN, to ensure that the 
distance measure accords equal weight to each variable. 
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H. Random Forests 

The author of the paper also used Random Forests from 
scikit-leam Q as a binary classifier and also to evaluate 
the feature importance in order to understand which are 
the most important factors while predicting the injury risk. 
Random forests are a combination of tree predictors, where 
each tree depends on the values of a random vector sampled 
independently and with the same distribution for all trees in 
the forest When compared with the original publication 
from Brainman Gl. the scikit-leam implementation combines 
classifiers by averaging their probabilistic prediction, instead 
of letting each classifier vote for a single class GD- Random 
Forests were also used to assess injury risk and the importance 
of factors in the aforementioned study done by Beshah 0 - 

The number of estimators (trees in the forest) used was 200, 
this number was chosen using non-exhaustive hyperparameter 
optimization through grid search. 


records (greater than 20.790), these records were just removed 
from the data set without causing any significant loss in the 
experimental analysis. 

V. Experimental Analysis 

This section provides a brief descriptive analysis of the data 
set used as well as the experimental results using different 
predictive models together with their model evaluations. 

A. Data set analysis 

The data set (after applying data cleansing), is comprised 
by 20.798 records of traffic accident events that took place in 
the city of Porto Alegre/RS (Brazil). The attributes of the data 
set can be categorized in TODO different types: 

1) Geospatial Attributes: These attributes, listed in Table |I] 
represents where the accident happened in space. They weren’t 
used in this study and were left for a further study. 


I. Model Evaluation 

In order to evaluate the predictive models trained in this 
study, the author used a cross-validation with a train data set 
with 60% of the instances from the original data set and with 
a test data set with 40% of the original data set. Both the 
training and the testing data set were random sampled from 
the original data set. 

To evaluate the predictive models, the author used the Area 
Under the Curve (AUC) - computed using the trapezoidal 
rule - of the Receiver Operating Characteristic (ROC), which 
is a graphical plot that shows the performance of a binary 
classifier varying the discrimination threshold. The curve in 
this study was plotted using the true positive rate against the 
false positive rate at various threshold settings (one for each 
different predictive outcome from each model). 

Also, complementary to the ROC and AUC, the author of 
this study calculated tables presenting the Precision, Recall 
and Fl-Score for each class from each predictive model used. 

Since some algorithms used in this study didn’t have a 
natural probability estimate outcome like Logistic Regression 
has for each class, the author hence used different estimating 
techniques in order to be able to compare the ROC and AUC 
between different classifiers: 

1) Support Vector Machines: In this case, the probability 
estimates were calculated by LIBSVM using Platt scaling G3- 

2) K-nearest neighbors: For kNN, the predicted probability 
for each class is the ratio of neighbors voting for each label, 
i.e. if fc = 5 and 4 neighbors predicted class 1 and only 
one neighbor predicted class 0, then the probabilities for that 
example is 0.2 and 0.8. 

3) Random Forest: The probabilities of a forest are the 
mean probabilities of the trees in the ensemble and the 
probabilities returned by a single tree are the normalized class 
histograms of the leaf that a sample lands in. 

J. Data Cleansing 

Some accident records present in the data set had an invalid 
date/time format and since the amount of invalid records was 
very low (less than 5) when compared with the amount of valid 


TABLE I 

Geospatial Attributes 


Attribute Name 
LOGl and LOG2 
PREDIAL 1 
REGION 

LATITUDE and LONGITUDE 
LOCAL_VIA and REGION 


Description 
Street names. 

Street numbers. 

Region of the city. 

The geographical coordinates. 

Concatenation of LOGl, LOG2 and PREDIAL 1. 


2) Irrelevant Attributes: These attributes are irrelevant to 
the analysis of factors or injury risk assessment. They are 
presented in Table 


TABLE II 

Irrelevant Attributes 
Attribute Name Description 
ID The unique ID of the accident. 

BOLETIM The ID of the traffic agency record. 

3) Attributes with data leakage: Since the main goal of 
this study is to predict the risk of injury/non-injury, an extra 
care was taken to discover attributes that could leak to the 
target class. The result of this evaluation is present in the 
Table |nl] The attribute “FONTE” leaks information about the 
injury target class because the police is usually involved only 
when there was someone injuried. The attribute “UPS” also 
leaks information about the target class because it assumes 
3 different values: 1 (accident only with property damage), 
5 (accident with someone injuried) and 13 (accident with 
deaths), so when the UPS is 5 or 13 it will perfect predict 
the injuried/non-injuried target classes. 

4} Relevant attributes: These are attributes that were used 
to train all the predictive models presented in this study. They 
are shown in the Table |IV| Except the counting attributes, all 
attributes were preprocessed using one-hot encoding scheme 
(aka. one-of-K scheme). 
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TABLE III 

Attributes with data leakage 


Attribute Name Description 


Attribute Name 

LOCAL 

TIPO_ACID 

DIA_SEM 

CONSORCIO 

AUTO 

TAXI 

LOTACAO 

ONIBUS_URB 

ONIBUS_MET 

CAMINHAO 

MOTO 

CARROCA 

BICICLETA 

OUTRO 

TEMPO 

NOITE_DIA 

MES 

FX_HORA 

CORREDOR 


Description 


Whether the accident happened on a street or in crossing 
The type of the accident (collision, fire, etc...). 

The day of the week. 

If a bus were involved, the name of the company. 

The count of cars involved. 

The count of cabs involved. 

The count of small bus involved. 

The count of urban bus involved. 

The count of bus (others) involved. 

The count of trucks involved. 

The count of motorcycles involved. 

The count of carts involved. 

The count of bikes involved. 

The count of vehicles (others) involved. 

How was the weather (raining, clear, etc.). 

If it was night or day. 

The month of the accident. 

The hour that accident happened. 


Fig. 1. Heat map of the traffic accidents. 


B. Model evaluation 


The model evaluation results (Precision, Recall, FI-score) 
using test data set for Logistic Regression and Support Vector 
Machine (SVM) are shown in the Table |VI| the results for 
Naive Bayes and K-nearest neighbors are shown in the Table 
IVllI and the results for the Random Forest model evaluation is 
shown in the Table IVIIII The ROC curves and the AUC value 
for each model for the positive class (injury) prediction with 
Whether the accident happened in the bus lane road or not.varying discriminative thi'eshold is presented in the Figure 1^ 


TABLE IV 

Relevant Attributes 


FONTE 

UPS 


Whether the accident was registered 
agency or by the police. 

A severity measurement. 


by 


the 


traffic 


5) Target attribute: Since the aim of this work is to predict 
if the outcome of an traffic accident was an injury/non-injury, 
the author merged (summed) the features shown in the Table 
lY] and then created a new attribute with this value that was 
later converted to 0 (non-injury) if sum was less or equal than 
zero, or 1 (injury) if the sum was greater or equal to 1. 

TABLE V 

Attributes merged to create the target attribute 


As we can see. Logistic Regression and Support Vector 
Machines models provided the best scores in AUC and av¬ 
erage Precision/Recall/Fl-score, they also were very similar 
regarding the performance. Random Forest also performed 
well with an AUC of 0.93 when compared with AUC of SVM 
and Logistic Regression that scored an AUC of 0.94 each. K- 
nearest neighbor performed below the scores of SVM, Logistic 
Regression and Random Forest with an AUC of 0.90, but it 
still performed better than the worst model which is Naive 
Bayes with an AUC of 0.83. 


Attribute Name 
FERIDOS 

feridos_gr 

MORTES 

MORTES_POST 

FATAIS 


Description _ TABLE VI 

The count of injured people involved in the accident. SUPPORT VECTOR MACHINE AND LOGISTIC REGRESSION EVALUATION 


The count of serious injured people involved in the accident. Support Vector Machine Logistic Regression 


The count of deaths (local deaths) in the accident. 


Precision 

Recall 

FI-score 

Precision 

Recall 

Fl-Score 

The count of deaths (posterior deaths) happened after the 

aJinteiiljury 

0.90 

0.96 

0.93 

0.90 

0.96 

0.93 

The sum of MORTES and MORTES_POST attributes. 

Injury 

0.89 

0.76 

0.82 

0.89 

0.76 

0.82 


Average 

0.90 

0.90 

0.89 

0.89 

0.90 

0.89 


It is also important to note that the data set is imbalanced 
and it has a ratio of records of at least 2:1 between the 
target classes (injuty/non-injury), totaling 14.247 non-injury 
instances and 6.551 injury records. 

The geospatial information related to the accident events 
weren’t used in this study, but the heat map shown in Figure 
shows an important pattern that clearly confirms that the ac¬ 
cidents density increases on crossing streets. This information 
is represented not only in latitude/longitude attributes but also 
in the “LOCAL” attribute used to train the predictive models 
in this study. 


TABLE VII 

Naive Bayes and K-nearest neighbors evaluation 


Naive Bayes K-nearest neighbors evaluation 



Precision 

Recall 

Fl-score 

Precision 

Recall 

Fl-Score 

Non-injury 

0.96 

0.23 

0.38 

0.85 

0.96 

0.90 

Injury 

0.37 

0.98 

0.54 

0.88 

0.63 

0.73 

Average 

0.78 

0.47 

0.43 

0.86 

0.86 

0.85 
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TABLE VIII 

Random Forest evaluation 


TABLE IX 

Attribute Importances 


Random Forest 


Precision 

Recall 

Fl-score 

Non-injury 

0.90 

0.94 

0.92 

Injury 

0.85 

0.76 

0.80 

Average 

0.88 

0.88 

0.88 


Receiver operating characteristic (ROC) 



Fig. 2. The Receiver Operating Characteristic (ROC) and Area Under Curve 
(AUC) of each model for the positive class (injury). 


C. Variable Importance 

According to Strobl et al. p?) . Random Forests have been 
successfully applied to various problems and within a very 
short period of time, random forests have become a major 
data analysis tool, that performs well in comparison with many 
standard methods. One of the factors that greatly contributed to 
the popularity of Random Forests was that it produces variable 
importance measures for each predictor variable. 

The experimental results to evaluate the attributes that had 
more importance while predicting the injury risk were obtained 
using the same trained model with the evaluation presented in 
the Table VIII The first 10 most important attributes that were 
described by this model, with their respective importances, are 


presented in the Table IX 


while predicting the injury risk, followed by other attributes, 
like cars involved and if accident was a run over, among others. 
As we can see in the Table IX the motorcycle count attribute 
had the largest importance 


VI. Conclusions and Future Work 

As we can see, the experimental results demonstrated that 
prediction models for injury risk assessment can be created 
with good precision, even with limited data sets, like the one 
used in this study that lacks information about vehicle drivers, 
victims and vehicle movements. These results, together with 
the variable importance analysis, can be used by traffic man¬ 
aging agencies to understand the provided data sets with an 
even greater depth than the limited descriptive analysis that is 
being carried today by these agencies. 


Importance Attribute Name Description 


0.2108 

MOTO 

The count of motorcycles involved. 

0.0948 

AUTO 

The count of cars involved. 

0.0925 

TIPO_ACID 



ATROPELAMENTO 

If the type of the accident was a run over. 

0.0391 

LOCAL 



LOGRADOURO 

If the accident was on a normal street. 

0.0368 

LOCAL 



CRUZAMENTO 

If the accident was on crossing streets. 

0.0267 

TIPO_ACID 



COLISAO 

If the type of the accident was a collision. 

0.0205 

TIPO_ACID 



QUEDA 

If the type of the accident was a fall. 

0.0203 

CAMINHAO 

The count of trucks involved. 

0.0182 

TIPO_ACID 



ABALROAMENTO 

If the type of the accident 
was a collision (on the side). 

0.0181 

NOITE_DIA 



DIA 

If the accident happened during night time. 


This study didn’t used the geospatial data, but the author 
believes that this information is also a critical factor to the 
prediction of the injury risk associated with an traffic accident. 
The use of the geospatial data was left to a future study 
due to the very specific nature of the geospatial data format, 
which requires different preprocessing approach before being 
employed. 

Future works can also include better hyperparameter opti¬ 
mization with a more intensive search for better parameters 
such as kNN neighbor size, SVM error term, SVM kernel 
parameters. Random Forest estimators count, among others. 
This study also didn’t applied feature selection techniques, 
but the author believes that a future work could also improve 
the models performance by using feature selection methods. 
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